Image segmentation and object detection using fully convolutional neural network

ABSTRACT

This disclosure relates to digital image segmentation, region of interest identification, and object recognition. This disclosure describes a method, a system, for image segmentation based on fully convolutional neural network including an expansion neural network and contraction neural network. The various convolutional and deconvolution layers of the neural networks are architected to include a coarse-to-fine residual learning module and learning paths, as well as a dense convolution module to extract auto context features and to facilitate fast, efficient, and accurate training of the neural networks capable of producing prediction masks of regions of interest. While the disclosed method and system are applicable for general image segmentation and object detection/identification, they are particularly suitable for organ, tissue, and lesion segmentation and detection in medical images.

RELATED APPLICATION

This application is a continuation application of U.S. patentapplication Ser. No. 16/104,449 filed on Aug. 17, 2018, which isincorporated by reference in its entirety.

TECHNICAL FIELD

This disclosure relates generally to computer segmentation and objectdetection in digital images and particularly to segmentation ofmulti-dimensional and multi-channel medical images and to detection oflesions based on convolutional neural networks.

BACKGROUND

A digital image may contain one or more regions of interest (ROIs). Inmany applications, only image data contained within the one or more ROIsof a digital image may need to be retained for further processing andfor information extraction by computers. Efficient and accurateidentification of these ROIs thus constitutes a critical step in imageprocessing applications, including but not limited to applications thathandle high-volume and/or real-time digital images. Each ROI of adigital image may contain pixels forming patches with drastic variationin texture and pattern, making accurate and efficient identification ofthe boundary between these ROIs and the rest of the digital image achallenging task for a computer. In some imaging processingapplications, an entire ROI or a subsection of an ROI may need to befurther identified and classified. For example, in the field ofcomputer-aided medical image analysis and diagnosis, an ROI in a medicalimage may correspond to a particular organ of a human body and the organregion of the image may need to be further processed to identify, e.g.,lesions within the organ, and to determine the nature of the identifiedlesions.

SUMMARY

This disclosure is directed to an enhanced convolutional neural networkincluding a contraction neural network and an expansion neural network.These neural networks are connected in tandem and are enhanced using acoarse-to-fine architecture and densely connected convolutional moduleto extract auto-context features for more accurate and more efficientsegmentation and object detection in digital images.

The present disclosure describes a method for image segmentation. Themethod includes receiving, by a computer comprising a memory storinginstructions and a processor in communication with the memory, a set oftraining images labeled with a corresponding set of ground truthsegmentation masks. The method includes establishing, by the computer, afully convolutional neural network comprising a multi-layer contractionconvolutional neural network and an expansion convolutional neuralnetwork connected in tandem. The method includes iteratively training,by the computer, the full convolution neural network in an end-to-endmanner using the set of training images and the corresponding set ofground truth segmentation masks by down-sampling, by the computer, atraining image of the set of training images through the multi-layercontraction convolutional neural network to generate an intermediatefeature map, wherein a resolution of the intermediate feature map islower than a resolution of the training image; up-sampling, by thecomputer, the intermediate feature map through the multi-layer expansionconvolutional neural network to generate a first feature map and asecond feature map, wherein a resolution of the second feature map islarger than the resolution of the first feature map; generating, by thecomputer based on the first feature map and the second feature map, apredictive segmentation mask for the training image; generating, by thecomputer based on a loss function, an end loss based on a differencebetween the predictive segmentation mask and a ground truth segmentationmask corresponding to the training image; back-propagating, by thecomputer, the end loss through the full convolutional neural network;and minimizing, by the computer, the end loss by adjusting a set oftraining parameters of the fully convolutional neural network usinggradient descent.

The present disclosure also describes a computer image segmentationsystem for digital images. The computer image segmentation system fordigital images includes a communication interface circuitry; a database;a predictive model repository; and a processing circuitry incommunication with the database and the predictive model repository. Theprocessing circuitry configured to: receive a set of training imageslabeled with a corresponding set of ground truth segmentation masks. Theprocessing circuitry configured to establish a fully convolutionalneural network comprising a multi-layer contraction convolutional neuralnetwork and an expansion convolutional neural network connected intandem. The processing circuitry configured to iteratively train thefull convolution neural network in an end-to-end manner using the set oftraining images and the corresponding set of ground truth segmentationmasks by configuring the processing circuitry to: down-sample a trainingimage of the set of training images through the multi-layer contractionconvolutional neural network to generate an intermediate feature map,wherein a resolution of the intermediate feature map is lower than aresolution of the training image; up-sample the intermediate feature mapthrough the multi-layer expansion convolutional neural network togenerate a first feature map and a second feature map, wherein aresolution of the second feature map is larger than the resolution ofthe first feature map; generate, based on the first feature map and thesecond feature map, a predictive segmentation mask for the trainingimage; generate, based on a loss function, an end loss based on adifference between the predictive segmentation mask and a ground truthsegmentation mask corresponding to the training image; back-propagating,by the computer, the end loss through the full convolutional neuralnetwork; and minimizing, by the computer, the end loss by adjusting aset of training parameters of the fully convolutional neural networkusing gradient descent.

The present disclosure also describes a non-transitory computer readablestorage medium storing instructions. The instructions, when executed bya processor, cause the processor to receive a set of training imageslabeled with a corresponding set of ground truth segmentation masks. Theinstructions, when executed by a processor, cause the processor toestablish a fully convolutional neural network comprising a multi-layercontraction convolutional neural network and an expansion convolutionalneural network connected in tandem. The instructions, when executed by aprocessor, cause the processor to iteratively train the full convolutionneural network in an end-to-end manner using the set of training imagesand the corresponding set of ground truth segmentation masks by causingthe processor to: down-sample a training image of the set of trainingimages through the multi-layer contraction convolutional neural networkto generate an intermediate feature map, wherein a resolution of theintermediate feature map is lower than a resolution of the trainingimage; up-sample the intermediate feature map through the multi-layerexpansion convolutional neural network to generate a first feature mapand a second feature map, wherein a resolution of the second feature mapis larger than the resolution of the first feature map; generate, basedon the first feature map and the second feature map, a predictivesegmentation mask for the training image; generate, based on a lossfunction, an end loss based on a difference between the predictivesegmentation mask and a ground truth segmentation mask corresponding tothe training image; back-propagating, by the computer, the end lossthrough the full convolutional neural network; and minimizing, by thecomputer, the end loss by adjusting a set of training parameters of thefully convolutional neural network using gradient descent.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a general data/logic flow of various fullyconvolutional neural networks (FCNs) for implementing image segmentationand object detection.

FIG. 2 illustrates an exemplary general implementation and data/logicflows of the fully convolutional neural network of FIG. 1.

FIG. 3 illustrates an exemplary implementation and data/logic flows ofthe fully convolutional neural network of FIG. 1.

FIG. 4 illustrates another exemplary implementation and data/logic flowsof the fully convolutional neural network of FIG. 1.

FIG. 5 illustrates an exemplary implementation and data/logic flows ofthe fully convolutional neural network enhanced by a coarse-to-finearchitecture.

FIG. 6A illustrates an exemplary implementation and data/logic flows ofthe fully convolutional neural network having a coarse-to-finearchitecture with auxiliary segmentation masks.

FIG. 6B illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6A that combinesthe auxiliary segmentation masks by concatenation.

FIG. 6C illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6B with furtherimprovement.

FIG. 6D illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6A that combinesthe auxiliary segmentation masks by summation.

FIG. 6E illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6D with furtherimprovement.

FIG. 6F illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6B as stage I,further including a dense-convolution (DenseConv) module in a stage II.

FIG. 6G illustrates an exemplary implementation and data/logic flows ofthe enhanced fully convolutional neural network of FIG. 6F with furtherimprovement.

FIG. 7 illustrates an exemplary implementation and data/logic flows ofthe DenseConv module of FIG. 6F.

FIG. 8 illustrates a flow diagram of a method for training the exemplaryimplementation in FIG. 6F or 6G.

FIG. 9 illustrates an exemplary implementation and data/logic flows ofcombining various fully convolutional neural networks.

FIG. 10 shows an exemplary computer platform for segmenting digitalimages.

FIG. 11 illustrates a computer system that may be used to implementvarious computing components and functionalities of the computerplatform of FIG. 10.

DETAILED DESCRIPTION Introduction

A digital image may contain one or more regions of interest (ROIs). AnROI may include a particular type of object. In many applications, onlyimage data within the ROIs contains useful or relevant information. Assuch, recognition of ROIs in a digital image and identification ofboundaries for these ROIs using computer vision often constitute acritical first step before further image processing is performed. Adigital image may contain multiple ROIs of a same type or may containROIs of different types. For example, a digital image may contain onlyhuman faces or may contain both human faces of and other objects ofinterest. Identification of ROIs in a digital image is oftenalternatively referred to as image segmentation. The term “digitalimage” may be alternatively referred to as “image”.

An image may be a two-dimensional (2D) image. A 2D image includes pixelshaving two-dimensional coordinates, which may be denoted along an x-axisand a y-axis. The two-dimensional coordinates of the pixels maycorrespond to a spatial 2D surface. The spatial 2D surface may be aplanar surface or a curved surface projected from a three-dimensionalobject.

An image may have multiple channels. The multiple channels may bedifferent chromatic channels, for example and not limited to,red-green-blue (RGB) color channels. The multiple channels may bedifferent modality channels for a same object, representing images ofthe same object taking under different imaging conditions. For example,in conventional photography, different modalities may correspond todifferent combinations of focus, aperture, exposure parameters, and thelike. For another example, in medical images based on Magnetic ResonanceImaging (MRI), different modality channels may include but are notlimited to T2-weighted imaging (T2 W), diffusion weighted imaging (DWI),apparent diffusion coefficient (ADC) and K-trans channels.

An image may be a three-dimensional (3D) image. A 3D image includespixels having three-dimensional coordinates, which may be denoted alongan x-axis, a y-axis, and a z-axis. The three-dimensional coordinates ofthe pixels may correspond to a spatial 3D space. For example, MRI imagesin each modality channel may be three dimensional, including a pluralityof slices of 2D images.

A 3D image may also have multiple channels, effectively forming afour-dimensional (4D) image. The 4D image including multiple channelsmay be referred to as pseudo 4D image. A 4D image includes pixels havingfour-dimensional coordinates, which may be denoted along an x-axis, ay-axis, a z-axis, and a channel-number.

ROIs for an image, once determined, may be represented by a digital maskcontaining a same number of pixels as the digital image or down-sizednumber of pixels from the digital image. A digital mask may bealternatively referred as a mask or a segmentation mask. Each pixel ofthe mask may contain a value used to denote whether a particularcorresponding pixel of the digital image is among any ROI, and if it is,which type of ROI among multiple types of ROIs does it fall. Forexample, if there is only a single type of ROI, a binary mask issufficient to represent all ROIs. In particular, each pixel of the ROImask may be either zero or one, representing whether the pixel is or isnot among the ROIs. For a mask capable of representing multiple types ofROI, each pixel may be at one of a number of values each correspondingto one type of ROIs. A multi-value mask, however, may be decomposed intoa combination of the more fundamental binary masks each for one type ofROI.

For a 2D image, its mask may correspondingly be 2D, including pixelshaving two-dimensional coordinates. When an image includes multiplechannels, its mask may nevertheless be a single combined mask, whereinthe single mask corresponds to all the channels in the image. In someother embodiment, the mask of a multi-channel image may be amulti-channel mask, wherein each of the multiple channels of the maskcorresponds to one or more channels of the multi-channel image.

For a 3D image, its mask may correspondingly be a 3D mask, includingpixels having three-dimensional coordinates along an x-axis, a y-axis,and a z-axis. When a 3D image includes multiple channels, its mask maybe a three-dimensional mask having either a single channel or multiplechannels, similar to the 2D mask described above.

ROI masks are particularly useful for further processing of the digitalimage. For example, an ROI mask can be used as a filter to determine asubset of image data that are among particular types of ROIs and thatneed be further analyzed and processed. Image data outside of theseparticular types of ROIs may be removed from further analysis. Reducingthe amount of data that need to be further processed may be advantageousin situations where processing speed is essential and memory space islimited. As such, automatic identification of ROIs in a digital imagepresents a technological problem to be overcome before furtherprocessing can be performed on the data which form the digital image.

ROI identification and ROI mask generation, or image segmentation may beimplemented in various applications, including but not limited to faceidentification and recognition, object identification and recognition,satellite map processing, and general computer vision and imageprocessing. For example, ROI identification and segmentation may beimplemented in medical image processing. Such medical images may includebut are not limited to Computed Tomography (CT) images, MagneticResonance Imaging (MRI) images, ultrasound images, X-Ray images, and thelike. In Computer-Aided Diagnosis (CAD), a single or a group of imagesmay first be analyzed and segmented into ROIs and non-ROIs. One or moreROI masks, alternatively referred to as segmentation masks, may begenerated. An ROI in a medical image may be specified at various levelsdepending on the applications. For example, an ROI may be an entireorgan. As such, a corresponding binary ROI mask may be used to mask thelocation of the organ tissues and the regions outside of the ROI andthat are not part of the organ. For another example, an ROI mayrepresent a lesion in an organ or tissue of one or more particular typesin the organ. These different levels of ROIs may be hierarchical. Forexample, a lesion may be part of an organ.

The present disclosure may be particularly applied to different types ofimages by imaging various types of human tissues or organs to performROI identification and ROI mask generation and image segmentation, forexample, including but not limited to, brain segmentation, pancreassegmentation, lung segmentation, or prostate segmentation.

In one embodiment, MR images of prostate from one or more patient may beprocessed using computer aided diagnosis (CAD). Prostate segmentationfor marking the boundary of the organ of prostate is usually the firststep in a prostate MR image processing and plays an important role incomputer aided diagnosis of prostate diseases. One key to prostatesegmentation is to accurately determine the boundary of the prostatetissues, either normal or pathological. Because images of normalprostate tissues may vary in texture and an abnormal prostate tissue mayadditionally contain patches of distinct or varying texture andpatterns, identification of prostate tissues using computer vision maybe particularly challenging. Misidentifying a pathological portion ofthe prostate tissue as not being part of the prostate and masking it outfrom subsequent CAD analysis may lead to unacceptable false diagnosticnegatives. Accordingly, the need to accurately and reliably identify anROI in a digital image such as a medical image of a prostate or otherorgans is critical to proper medical diagnosis.

Segmentation of images may be performed by a computer using a modeldeveloped using deep neural network-based machine learning algorithms.For example, a segmentation model may be based on Fully ConvolutionalNetwork (FCN) or Deep Convolutional Neural Networks (DCNN). Modelparameters may be trained using labeled images. The image labels in thiscase may contain ground truth masks, which may be produced by humanexperts or via other independent processes. The FCN may contain onlymultiple convolution layers-. In an exemplary implementation, digitalimages of lungs may be processed by such neural networks for lungsegmentation and computer aided diagnosis. During the training process,the model learns various features and patterns of lung tissues using thelabeled images. These features and patterns include both global andlocal features and patterns, as represented by various convolutionlayers of the FCN. The knowledge obtained during the training process isembodied in the set of model parameters representing the trained FCNmodel. As such, a trained FCN model may process an input digital imagewith unknown mask and output a predicted segmentation mask. It iscritical to architect the FCN to facilitate efficient and accuratelearning of image features of relevance to a particular type of images.

A System to Perform CAD Using FCN

As FIG. 1 shows, a system 100 for CAD using a fully convolutionalnetwork (FCN) may include two stages: tissue segmentation and lesiondetection. A first stage 120 performs tissue (or organ) segmentationfrom images 110 to generate tissue segmentation mask 140. A second stage160 performs lesion detection from the tissue segmentation mask 140 andthe images 110 to generate the lesion mask 180. In some implementations,the lesion mask, may be non-binary. For example, within the organ regionof the image 110 as represented by the segmentation mask 140, there maybe multiple lesion regions of different pathological types. The lesionmask 180 thus may correspondingly be non-binary as discussed above, andcontains both special information (as to where the lesions are) andinformation as to the type of each lesion. According to the lesion mask180, the system may generate a diagnosis with a certain probability fora certain disease. A lesion region may be portions of the organcontaining cancer, tumor, or the like. In some embodiments, the systemmay only include the stage of tissue segmentation 120. In otherembodiments, the system may only include the stage of lesion detection160. In each stage of the first stage 120 or the second stage 160, anFCN of a slightly variation may be used. The first stage 120 and thesecond stage 160 may use the same type of FCN, or may use different typeof FCN. The FCNs for stage 120 and 160 may be separately or jointlytrained.

In one exemplary implementation, the tissue or organ may be a prostate,and CAD of prostate may be implemented in two steps. The first step mayinclude determining a segmentation boundary of prostate tissue byprocessing input MR prostate images, producing an ROI mask for theprostate; and the second step includes detecting a diseased portion ofthe prostate tissue, e.g., a prostate tumor or cancer, by processing theROI masked MR images of the prostate tissue. In another embodiment, MRimages of prostate from one or more patients may be used in computeraided diagnosis (CAD). For each patient, volumetric prostate MR imagesis acquired with multiple channels from multi-modalities including,e.g., T2 W, DWI, ADC and K-trans, where ADC maps may be calculated fromDWI, and K-trans images may be obtained using dynamic contrast enhanced(DCE) MR perfusion.

The FCN may be adapted to the form of the input images (eithertwo-dimensional images or three-dimensional images, either multichannelor single-channel images). For example, the FCN may be adapted toinclude features of an appropriate number of dimensions for processing asingle channel 2D or 3D images or a multichannel 2D or 3D images.

In one embodiment, for example, the images 110 may be one or more 2Dimages each with a single channel (SC). The first stage 120 may usetwo-dimensional single-channel FCN (2D, SC) 122 to perform tissuesegmentation to obtain the segmentation mask 140 of an input image 110.The segmentation mask 140 may be a single 2D mask (or a mask with singlechannel). The second stage 160 may use two-dimensional single-channelFCN (2D, SC) 162 to perform lesion detection. The FCN (2D, SC) 162operates on the single channel segmentation mask 140 and the singlechannel image 110 (via arrow 190) to generate the lesion mask 180.Accordingly, the lesion mask 180 may be a single-channel 2D mask.

In a second embodiment, the images 110 may be one or more 2D images eachwith multiple channels (MC). The multiple channels may be differentchromatic channels (e.g., red, blue, and green colors) or otherdifferent imaging modality channels (e.g., T2 W, DWI, ADC and K-transfor MR imaging). The first stage 120 may use two-dimensionalsingle-channel FCN (2D, SC) 122 to perform tissue segmentation ofmultiple channels of an input image 110 to obtain multi-channelsegmentation mask 140. In particular, since the two-dimensionalsingle-channel FCN (2D, SC) 122 is used in the tissue segmentation, eachchannel of the multi-channel image 110 may be regarded as independentfrom other channels, and thus each channel may be processedindividually. Therefore, the segmentation mask 140 may be a 2D mask withmultiple channels. The second stage 160 may use two dimensionalmulti-channel FCN (2D, MC) 164 to perform lesion detection. The FCN (2D,MC) 164 may operate simultaneously on the multi-channel segmentationmask 140 and the multi-channel image 110 to generate the lesion mask180. The two-dimensional multi-channel FCN (2D, MC) 164 may process themulti-channel mask 140 and the multi-channel image 110 in a combinedmanner to generate a single-channel lesion mask 180.

In a third embodiment, the images 110 may be one or more 2D images withmultiple channels. The first stage 120 may use two-dimensionalmulti-channel FCN (2D, MC) 124 to perform tissue segmentation of amulti-channel image 110 in a combined manner to obtain a single-channelsegmentation mask 140. The second stage 160 may use two-dimensionalsingle-channel FCN (2D, SC) 162 to perform lesion detection in thesingle channel mask 140. In particular, the two-dimensionalsingle-channel FCN (2D, SC) 162 operates on the segmentation mask 140and the multi-channel image 110 to generate a multi-channel lesion mask180.

In a fourth embodiment, the images 110 may be one or more 2D images withmultiple channels. The first stage 120 may use two-dimensionalmulti-channel FCN (2D, MC) 124 to perform tissue segmentation on themulti-channel image 110 to obtain a single-channel segmentation mask140. The second stage 160 may use two-dimensional multi-channel FCN (2D,MC) 164 to perform lesion detection. The two-dimensional multi-channelFCN (2D, MC) 164 operates on the single-channel segmentation mask 140and the multi-channel images 110 to generate a single-channel lesionmask 180.

In a fifth embodiment, the images 110 may be one or more 3D images witha single channel. The first stage 120 may use three-dimensionalsingle-channel FCN (3D, SC) 126 to perform tissue segmentation of asingle channel image 110 to obtain a three dimensional single-channelsegmentation mask 140. The second stage 160 may use three-dimensionalsingle-channel FCN (3D, SC) 166 to perform lesion detection. The FCN(3D, SC) 166 operates on the single-channel segmentation mask 140 andthe single-channel image 110 to generate a three-dimensionalsingle-channel lesion mask 180.

In a sixth embodiment, the images 110 may be one or more 3D images withmultiple channels. The multiple channels may be different chromaticchannels (e.g., red, blue, and green colors) or other different modalitychannels (e.g., T2 W, DWI, ADC and K-trans for MR imaging). The firststage 120 may use three-dimensional single-channel FCN (3D, SC) 126 toperform tissue segmentation to obtain three-dimensional multi-channelsegmentation mask 140. In particular, each channel may be regarded asindependent from other channels, and thus each channel is processedindividually by the FCN (3D, SC) 126. The second stage 160 may usethree-dimensional single-channel FCN (3D, SC) 166 to perform lesiondetection. The FCN (3D, SC) 166 operates on the multi-channelsegmentation mask 140 and the multi-channel images 110 to generate thelesion mask 180. Since single-channel FCN (3D, SC) 166 is used in thelesion detection, each channel of the multiple channels may be processedindependent. Therefore, the lesion mask 180 may be 3D multi-channelmask.

In a seventh embodiment, the images 110 may be one or more 3D imageswith multiple channels. The first stage 120 may use three-dimensionalsingle-channel FCN (3D, SC) 126 to perform tissue segmentation to obtainthe segmentation mask 140. Since FCN (3D, SC) 126 is used in the tissuesegmentation, each channel may be regarded as independent from otherchannels, and thus each channel may be processed individually.Therefore, the segmentation mask 140 may be three-dimensionalmulti-channel mask. The second stage 160 may use three-dimensionalmulti-channel FCN (3D, MC) 168 to perform lesion detection. The FCN (3D,MC) 168 operates on the multi-channel segmentation mask 140 and themulti-change image 110 to generate the lesion mask 180. Sincemulti-channel FCN (3D, MC) 168 is used in the lesion detection, multiplechannels may be processed in a combined/aggregated manner. Therefore,the lesion mask 180 may be three-dimensional single channel mask.

In an eighth embodiment, the images 110 may be one or more 3D imageswith multiple channels. The first stage 120 may use three-dimensionalmulti-channel FCN (3D, MC) 128 to perform tissue segmentation of the 3Dmulti-channel image 110 in an aggregated manner to obtain athree-dimensional single-channel segmentation mask 140. The second stage160 may use three-dimensional single-channel FCN (3D, SC) 166 to performlesion detection. The FCN (3D, SC) 166 operates on the single-channelsegmentation mask 140 and the multi-channel images 110 to generatethree-dimensional multi-channel lesion mask 180.

In a ninth embodiment, the images 110 may be one or more 3D images withmultiple channels. The first stage 120 may use three-dimensionalmulti-channel FCN (3D, MC) 128 to perform tissue segmentation of themulti-channel image 110 in an aggregated manner to obtainthree-dimensional single-channel segmentation mask 140. The second stage160 may use three-dimensional multi-channel FCN (3D, MC) 168 to performlesion detection. The FCN (3D, MC) 168 operates on the single-channelsegmentation mask 140 and the multi-channel image 110 to generatethree-dimensional single-channel lesion mask 180 in an aggregatedmanner.

Optionally, when the total z-axis depth of a 3D image 110 is relativelyshallow, a 2D mask may be sufficient to serve as the mask for the 3Dimage having different values along z-axis. In one exemplaryimplementation, for a 3D image 110 with a shallow total z-axis range, afirst stage 120 may use a modified FCN (3D, SC) to perform tissuesegmentation to obtain the segmentation mask 140, where the segmentationmask 140 180 may be a 2D mask.

The various two or three-dimensional and single- or multi-channel FCNmodels discussed above are further elaborated below.

Fully Convolutional Network (FCN)

Segmentation of images and/or target object (e.g., lesion) detection maybe performed by a computer using a model developed using deep neuralnetwork-based machine learning algorithms. For example, a model may bebased on Fully Convolutional Network (FCN), Deep Convolutional NeuralNetworks (DCNN), U-net, or V-Net. Hereinafter, the term “FCN” is used ingenerally refer to any CNN-based model.

Model parameters may be trained using labeled images. The image labelsin this case may contain ground truth masks, which may be produced byhuman experts or via other independent processes. The FCN may containmultiple convolution layers. An exemplary embodiment involves processingdigital images of lung tissues for computer aided diagnosis. During thetraining process, the model learns various features and patterns of,e.g., lung tissues or prostate tissues, using the labeled images. Thesefeatures and patterns include both global and local features andpatterns, as represented by various convolution layers of the FCN. Theknowledge obtained during the training process are embodied in the setof model parameters representing the trained FCN model. As such, atrained FCN model may process an input digital image with unknown maskand output a predicted segmentation mask.

The present disclosure describes an FCN with different variations. TheFCN is capable of performing tissue segmentation or lesion detection on2D or 3D images with single or multiple channels. 3D images are used asan example in the embodiment below. 2D images can be processed similarlyas 2D images may be regarded as 3D images with one z-slice, i.e, asingle z-axis value.

An exemplary FCN model for predicting a segmentation/lesion mask for aninput digital image is shown as FCN model 200 in FIG. 2. The FCN model200 may be configured to process input 2D/3D images 210. As an examplefor the illustration below, the input image may be 3D and may be of anexemplary size 128×128×24, i.e., 3D images having a size of 128 alongx-axis, a size of 128 along y-axis, and a size of 24 along z-axis.

A down-sampling path 220 may comprise a contraction neural network whichgradually reduces the resolution of the 2D/3D images to generate featuremaps with smaller resolution and larger depth. For example, the output251 from the down-sampling part 220 may be feature maps of 16×16×3 withdepth of 128, i.e., the feature maps having a size of 16 along x-axis, asize of 16 along y-axis, a size of 3 along z-axis, and a depth of 128(exemplary only, and corresponding to number of features).

The down-sampling path 220 is contracting because it processes an inputimage into feature maps that progressively reduce their resolutionsthrough one or more layers of the down-sampling part 220. As such, theterm “contraction path” may be alternatively referred to as“down-sampling path”.

The FCN model 200 may include an up-sampling path 260, which may be anexpansion path to generate high-resolution feature maps for voxel-levelprediction. The output 251 from the down-sampling path 220 may serve asan input to the up-sampling path 260. An output 271 from the up-samplingpath 260 may be feature maps of 128×128×24 with depth of 16, i.e., thefeature maps having a size of 128 along x-axis, a size of 128 alongy-axis, a size of 24 along z-axis, and a depth of 16.

The up sampling path 260 processes these feature maps in one or morelayers and in an opposite direction to that of the down-sampling path220 and eventually generates a segmentation mask 290 with a resolutionsimilar or equal to the input image. As such, the term “expansion path”may be alternatively referred to as “up-sampling path”.

The FCN model 200 may include a convolution step 280, which is performedon the output of the up-sampling stage with highest resolution togenerate map 281. The convolution operation kernel in step 280 may be1×1×1 for 3D images or 1×1 for 2D images.

The FCN model 200 may include a rectifier, e.g., sigmoid step 280, whichtake map 281 as input to generate voxel-wise binary classificationprobability prediction mask 290.

Depending on specific applications, the down-sampling path 220 andup-sampling path 260 may include any number of convolution layers,pooling layers, or de-convolutional layers. For example, for a trainingset having a relatively small size, there may be 6-50 convolutionlayers, 2-6 pooling layers and 2-6 de-convolutional layers. Specificallyas an example, there may be 15 convolution layers, 3 pooling layers and3 de-convolutional layers. The size and number of features in eachconvolutional or de-convolutional layer may be predetermined and are notlimited in this disclosure.

The FCN model 200 may optionally include one or more steps 252connecting the feature maps within the down-sampling path 220 with thefeature maps within the up-sampling path 260. During steps 252, theoutput of de-convolution layers may be fed to the corresponding featuremaps generated in the down-sampling path 220 with matching resolution(e.g., by concatenation, as shown below in connection with FIG. 4).Steps 252 may provide complementary high-resolution information into theup-sampling path 260 to enhance the final prediction mask, since thede-convolution layers only take coarse features from low-resolutionlayer as input. Feature maps at different convolution layer and withdifferent level of resolution from the contraction CNN may be input intothe expansion CNN as shown by the arrow 252.

The model parameters for the FCN include features or kernels used invarious convolutional layers for generating the feature maps, theirweight and bias, and other parameters. By training the FCN model usingimages labeled with ground truth segmentation masks, a set of featuresand other parameters may be learned such that patterns and textures inan input image with unknown label may be identified. In manycircumstances, such as medical image segmentation, this learning processmay be challenging due to lack of a large number of samples of certainimportant texture or patterns in training images relative to othertexture or patterns. For example, in a lung image, disease image patchesare usually much fewer than other normal image patches and yet it isextremely critical that these disease image patches are correctlyidentified by the FCN model and segmented as part of the lung. The largenumber of parameters in a typical multilayer FCN tend to over-fit thenetwork even after data augmentation. As such, the model parameters andthe training process are preferably designed to reduce overfitting andpromote identification of features that are critical but scarce in thelabeled training images.

Once the FCN 200 described above is trained, an input image may then beprocessed through the down-sampling/contraction path 220 andup-sampling/expansion path 260 to generate a predicted segmentationmask. The predicted segmentation mask may be used as a filter forsubsequent processing of the input image.

The training process for the FCN 200 may involve forward-propagatingeach of a set of training images through the down-sampling/contractionpath 220 and up-sampling/expansion path 260. The set of training imagesare each associated with a label, e.g., a ground truth mask 291. Thetraining parameters, such as all the convolutional features or kernels,various weights, and bias may be, e.g., randomly initialized. The outputsegmentation mask as a result of the forward-propagation may be comparedwith the ground truth mask 291 of the input image. A loss function 295may be determined. In one implementation, the loss function 295 mayinclude a soft max cross-entropy loss. In another implementation, theloss function 295 may include dice coefficient (DC) loss function. Othertypes of loss function are also contemplated. Then a back-propagationthrough the expansion path 260 and then the contraction path 220 may beperformed based on, e.g., stochastic gradient descent, and aimed atminimizing the loss function 295. By iterating the forward-propagationand back-propagation for the same input images, and for the entiretraining image set, the training parameters may converge to provideacceptable errors between the predicted masks and ground truth masks forall or most of the input images. The converged training parameters,including but not limited to the convolutional features/kernels andvarious weights and bias, may form a final predictive model that may befurther verified using test images and used to predict segmentationmasks for images that the network has never seen before. The model ispreferably trained to promote errors on the over-inclusive side toreduce or prevent false negatives in later stage of CAD based on apredicted mask.

FIG. 3 describes one specific example of FCN. The working of theexemplary implementation of FIG. 3 is described below. More detaileddescription is included in U.S. application Ser. No. 15/943,392, filedon Apr. 2, 2018 by the same Applicant as the present application, whichis incorporated herein by reference in its entirety.

As shown in step 310 of FIG. 3, 2D/3D images may be fed into the FCN.The 2D/3D images may be MR image patches with size 128×128×24.

The 2D/3D images 311 may be fed into step 312, which includes one ormore convolutional layers, for example and not limited to, twoconvolutional layers. In each convolutional layer, a kernel size may beadopted and number of filters may increase or be the same. For example,a kernel size of 3×3×3 voxels may be used in the convolution sub-step instep 312. Each convolutional sub-step is followed by batch normalization(BN) and rectified-linear unit (ReLU) sub-step.

Number of features at each convolution layer may be predetermined. Forexample, in the particular implementation of FIG. 3, the number offeatures for step 312 may be 16. The output of convolution between theinput and features may be processed by a ReLU to generate stacks offeature maps. The ReLUs may be of any mathematical form. For example,the ReLUs may include but are not limited to noisy ReLUs, leaky ReLUs,and exponential linear units.

After each convolution of an input with a feature in each layer, thenumber of pixels in a resulting feature map may be reduced. For example,a 100-pixel by 100-pixel input, after convolution with a 5-pixel by5-pixel feature sliding through the input with a stride of 1, theresulting feature map may be of 96 pixel by 96 pixel. Further, the inputimage or a feature map may be, for example, zero padded around the edgesto 104-pixel by 104-pixel such that the output feature map is the100-pixel by 100-pixel resolution. The example below may use zeropadding method, so that the resolution does not change before and afterconvolution function.

The feature maps from step 312 may be max-pooled spatially to generatethe input to a next layer below. The max-pooling may be performed usingany suitable basis, e.g., 2×2, resulting in down-sampling of an input bya factor of 2 in each of the two spatial dimensions. In someimplementations, spatial pooling other than max-pooling may be used.Further, different layers of 314, 324, and 334 may be pooled using asame basis or different basis, and using a same or different poolingmethod.

For example, after max-pooling by a factor of 2, the feature maps 321have a size of 64×64×12. Since the number of features used in step 312is 16. The depth of the feature maps 321 may be 16.

The feature map 321 may be fed into next convolution layer 322 withanother kernel and number of features. For example, the number offeatures in step 322 may be number 32.

In the example shown in FIG. 3, the output from step 324 may be featuremaps 331 having a size of 32×32×6. Since the number of features used instep 322 is 32. The depth of the feature maps 331 may be 32.

The feature map 331 may be fed into next convolution layer 332 withanother kernel and number of features. For example, the number offeatures in step 332 may be number 64.

In the example shown in FIG. 3, the output from step 334 may be featuremaps 341 having a size of 16×16×3. Since the number of features used instep 332 is 64. The depth of the feature maps 341 may be 64.

The feature map 341 may be fed into next convolution layer 342 withanother kernel and number of features. For example, the number offeatures in step 342 may be number 128.

In the example shown in FIG. 3, the output from step 342 may be featuremaps 351 having a size of 16×16×3. There is no size reduction betweenfeature maps 341 and feature maps 351 because there is no max-poolingsteps in between. Since the number of features used in step 342 is 128.The depth of the feature maps 351 may be 128.

The feature maps 351 of the final layer 342 of the down-sampling pathmay be input into an up-sampling path and processed upward through theexpansion path. The expansion path, for example, may include one or morede-convolution layers, corresponding to convolution layers on thecontraction path, respectively. The up-sampling path, for example, mayinvolve increasing the number of pixels for feature maps in each spatialdimension, by a factor of, e.g., 2, but reducing the number of featuremaps, i.e., reducing the depth of the feature maps. This reduction maybe by a factor, e.g., 2, or this reduction may not be by an integerfactor, for example, the previous layer has a depth of 128, and thereduced depth is 64.

The feature maps 351 may be fed into a de-convolution operation 364 with2×2×2 trainable kernels to increase/expand the size of input featuremaps by a factor of 2, the output of the de-convolution operation 364may be feature maps 361 having a size of 32×32×6. Since the number offeatures used in the de-convolution operation 364 is 64, the depth ofthe feature maps 361 is 64.

The feature maps 361 may be fed into a step 362. The step 362 mayinclude one or more convolution layers. In the example shown in FIG. 3,the step 362 includes two convolution layers. Each of the convolutionlayer includes a convolution function sub-step followed by batchnormalization (BN) and rectified-linear unit (ReLU) sub-step.

Optionally, as connections 352 a, 352 b, and 352 c in FIG. 3 show, ateach expansion layer of 362, 372, and 382, the feature maps from thedown-sampling path may be concatenated with the feature maps in theup-sampling path, respectively. Taking 352 c as an example, the featuremaps in step 332 of the down-sampling path has a size of 32×32×6 and adepth of 64. The feature maps 361 of the up-sampling path has a size of32×32×6 and a depth of 64. These two feature maps may be concatenatedtogether to form new feature maps having a size of 32×32×6 and a depthof 128. The new feature maps may be fed as the input into the step 362.The connection 352 c may provide complementary high-resolutioninformation, since the de-convolution layers only take course featuresfrom low-resolution layers as input.

The output feature maps of step 362 may be fed into de-convolution layer374. The de-convolution layer may have a trainable kernel 2×2×2 toincrease/expand the size of the input feature maps by a factor of 2. Theoutput of the de-convolution operation 374 may be feature maps 371having a size of 64×64×12. Since the number of features used in thede-convolution operation 374 is 32, the depth of the feature maps 371 is32.

Optionally, as the connection 352 b in FIG. 3 shows, the feature mapsfrom step 322 in the down-sampling path may be concatenated with thefeature maps 371. The feature maps in step 322 of the down-sampling pathhas a size of 64×64×12 and a depth of 32. The feature maps 371 of theup-sampling path has a size of 64×64×12 and a depth of 32. These twofeature maps may be concatenated together to form new feature mapshaving a size of 64×64×12 and a depth of 64. The new feature maps may befed as the input into the step 372.

The output feature maps of step 372 may be fed into de-convolution layer384. The de-convolution layer may have a trainable kernel 2×2×2 toincrease/expand the size of the input feature maps by a factor of 2. Theoutput of the de-convolution operation 384 may be feature maps 381having a size of 128×128×24. Since the number of features used in thede-convolution operation 384 is 16, the depth of the feature maps 381 is16.

Optionally, as the connection 352 a in FIG. 3 shows, the feature mapsfrom step 312 in the down-sampling path may be concatenated with thefeature maps 381. The feature maps in step 312 of the down-sampling pathhas a size of 128×128×24 and a depth of 16. The feature maps 371 of theup-sampling path has a size of 128×128×24 and a depth of 16. These twofeature maps may be concatenated together to form new feature mapshaving a size of 128×128×24 and a depth of 32. The new feature maps maybe fed as the input into the step 382.

The output figure maps from step 382 may be fed into a convolution step390, which perform on the feature maps with highest resolution togenerate feature maps 391. The feature maps 391 may have a size of128×128×24 and a depth of one. The convolution operation kernel in step390 may be 1×1×1 for 3D images or 1×1 for 2D images.

The feature maps 391 may be fed into a sigmoid step 392, which takefeature maps 391 as input to generate voxel-wise binary classificationprobabilities, which may be further determined to be the predictedsegmentation mask for the tissue or lesion.

FIG. 4 describes another embodiment of FCN for segmentation of 2D/3Dimages with either single or multiple channels. In the example describedbelow, 3D images with multiple channels may be taken as an example.However, the FCN model in this disclosure is not limited to 3D imageswith multiple channels (FCN (3D, MC)). For example, the FCN model inthis embodiment may be applied to 2D images with multiple channels(FCN(2D, MC)), 3D images with single channel (FCN(3D, SC)), or 2D imageswith single channel (FCN(2D, SC)), as shown in FIG. 1.

In FIG. 4, the input 2D/3D images with multiple channels 401 may include3D images having a size of 128×128×24 with three channels. The threechannels may be multi-parametric modalities in MR imaging including T2W,T1 and DWI with highest b-value. Or in some other situations, the threechannels may include red, green, and blue chromatic channels. In oneembodiment, each of the three channels may be processed independently.In another embodiment, the three channels may be processed collectively,and as such, the first convolutional layer operating on the input images401 has a 3×3×3×3 convolutional kernels.

The FCN model in FIG. 4 may include a down-sampling/contracting path anda corresponding up-sampling/expansion path. Thedown-sampling/contracting path may include one or more convolutionallayers 412.

The convolutional layer 412 may comprise 3×3×3 convolution kernels,batch normalization (BN), and ReLU activation function to extracthigh-level features. The number of features in the convolution layer 412may be 32, so that the feature maps of the convolution layer 412 mayhave a size of 128×128×24 and a depth of 32.

The FCN model in FIG. 4 may include convolution layers 414 with stridesgreater than 1 for gradually reducing the resolution of feature maps andincrease the receptive field to incorporate more spatial information.For example, the convolution layer 414 may have a stride of 2 alongx-axis, y-axis, and z-axis, so that the resolution along x-axis, y-axis,and z-axis all reduced by a factor of 2. The number of features in theconvolution layer 414 may be 64, so that the output feature maps 421 ofthe convolution layer 414 may have a size of 64×64×12 and a depth of 64.

The feature map may be fed into one or more convolution layers 422. Eachof the convolutional layers 422 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 422 may be 64, so that the output feature maps 423 of theconvolution layer 422 may have a size of 64×64×12 and a depth of 64.

At each resolution level, input feature maps may be directly added tooutput feature maps, which enables the stacked convolutional layers tolearn a residual function. As described by operator ⊕ in FIG. 4, thefeature maps 421 and 423 may be added together. The output feature maps425 of operator ⊕ may be a summation of feature maps 421 and 423 elementby element. The feature maps 425 may have a size of 64×64×12 and a depthof 64.

The feature maps 425 may be fed into a convolution layer 424 with astride of 2 along x-axis, y-axis, and z-axis, so that the resolutionalong x-axis, y-axis, and z-axis is reduced by a factor of 2. The numberof features in the convolution layer 424 may be 128, so that the outputfeature maps 431 of the convolution layer 424 may have a size of 32×32×6and a depth of 128.

The feature map 431 may be fed into one or more convolution layers 432.Each of the convolutional layers 432 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 432 may be 128, so that the output feature maps 433 of theconvolution layers 432 may have a size of 32×32×6 and a depth of 128.

At each resolution level, input feature maps may be directly added tooutput feature maps, which enables the stacked convolutional layers tolearn a residual function. As described by the operator ⊕ in FIG. 4, thefeature maps 431 and 433 may be added together. The output feature maps435 of operator ⊕ may be an element by element summation of feature maps431 and 433. The feature maps 435 may have a size of 32×32×6 and a depthof 128.

The feature maps 435 may be fed into a convolution layer 434 with astride of 2 along x-axis, y-axis, and z-axis, so that the resolutionalong x-axis, y-axis, and z-axis isl reduced by a factor of 2. Thenumber of features in the convolution layer 434 may be 256, so that theoutput feature maps 441 of the convolution layer 434 may have a size of16×16×3 and a depth of 256.

The feature map 441 may be fed into one or more convolution layers 442.Each of the convolutional layers 442 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 442 may be 256, so that the output feature maps 443 of theconvolution layers 442 may have a size of 16×16×3 and a depth of 256.

Again, at each resolution level, input feature maps may be directlyadded to output feature maps, which enables the stacked convolutionallayers to learn a residual function. As described by operator ⊕ in FIG.4, the feature maps 441 and 443 may be added together. The outputfeature maps 445 of the operator ⊕ may be an element-by-elementsummation of feature maps 441 and 443. The feature maps 445 may have asize of 16×16×3 and a depth of 256.

Optionally, the feature maps 445 may be fed into a convolution layer 444with a stride of 2 along x-axis and y-axis, and with a stride of 1 alongz-axis, so that the resolution along x-axis and y-axis is reduced by afactor of 2 and the resolution along z-axis remains the same. The numberof features in the convolution layer 444 may be 512, so that the outputfeature maps 451 of the convolution layer 444 may have a size of 8×8×3and a depth of 512.

The feature map 451 may be fed into one or more convolution layers 452.Each of the convolutional layers 452 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 452 may be 512, so that the output feature maps 453 of theconvolution layers 452 may have a size of 8×8×3 and a depth of 512.

At each resolution level, input feature maps may be directly added tooutput feature maps, which enables the stacked convolutional layers tolearn a residual function. As described by operator ⊕ in FIG. 4, thefeature maps 451 and 453 may be added together. The output feature maps455 of operator ⊕ may be an element-by-element summation of feature maps451 and 453. The feature maps 455 may have a size of 8×8×3 and a depthof 512.

The FCN model in FIG. 4 may include a correspondingup-sampling/expansion path to increase the resolution of feature mapsgenerated from the down-sampling/contracting path. Optionally, in theup-sampling/expansion path, the feature maps generated in thecontracting path may be concatenated with the output of de-convolutionallayers to incorporate high-resolution information.

The up-sampling/expansion path may include a de-convolution layer 464.The de-convolution layer 464 may de-convolute input feature maps 455 toincrease its resolution by a factor of 2 along x-axis and y-axis. Theoutput feature maps 461 of the de-convolution layer 464 may have a sizeof 16×16×3. The number of features in the de-convolution layer 464 maybe 256, so that the output feature maps 461 of the convolution layer 464may have a depth of 256.

As denoted by operator “©” in FIG. 4, the feature maps 443 generatedfrom the convolution layer 442 may be concatenated with the feature maps461 generated from the de-convolution layer 464. The feature maps 443may have a size of 16×16×3 and a depth of 256. The feature maps 461 mayhave a size of 16×16×3 and a depth of 256. The cascaded feature maps 463may correspondingly have a size of 16×16×3 and a depth of 512.

The feature map 463 may be fed into one or more convolution layers 462.Each of the convolutional layers 462 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 462 may be 256, so that the output feature maps 465 of theconvolution layers 462 may have a size of 16×16×3 and a depth of 256.

As described by the operator ⊕ 466 in FIG. 4, the feature maps 461 and465 may be added together. The output feature maps 467 of the operator ⊕466 may be an element-by-element summation of feature maps 461 and 465.The feature maps 467 may have a size of 16×16×3 and a depth of 512.

The feature maps 467 may be fed into a de-convolution layer 474. Thede-convolution layer 474 may de-convolute input feature maps 467 toincrease its resolution by a factor of 2 along x-axis, y-axis andz-axis. The output feature maps 471 of the de-convolution layer 474 mayhave a size of 32×32×6. The number of features in the de-convolutionlayer 474 may be 128, so that the output feature maps 471 of theconvolution layer 474 may have a depth of 128.

As denoted by operator “©” in FIG. 4, the feature maps 433 generatedfrom the convolution layer 432 may be concatenated with the feature maps471 generated from the de-convolution layer 474. The feature maps 433may have a size of 32×32×6 and a depth of 128. The feature maps 471 mayhave a size of 32×32×6 and a depth of 128. The concatenated feature maps473 may correspondingly have a size of 32×32×6 and a depth of 256.

The feature map 473 may be fed into one or more convolution layers 472.Each of the convolutional layers 472 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 472 may be 128, so that the output feature maps 475 of theconvolution layers 472 may have a size of 32×32×6 and a depth of 128.

As described by operator ⊕ 476 in FIG. 4, the feature maps 471 and 475may be added together. The output feature maps 477 of the operator ⊕ 476may be an element-by-element summation of feature maps 471 and 475. Thefeature maps 477 may have a size of 32×32×6 and a depth of 128.

The feature maps 477 may be fed into a de-convolution layer 484. Thede-convolution layer 484 may de-convolute input feature maps 477 toincrease its resolution by a factor of 2 along x-axis, y-axis andz-axis. The output feature maps 481 of the de-convolution layer 484 mayhave a size of 64×64×12. The number of features in the de-convolutionlayer 484 may be 64, so that the output feature maps 481 of theconvolution layer 484 may have a depth of 64.

As denoted by operator “©” in FIG. 4, the feature maps 423 generatedfrom the convolution layer 422 may be concatenated with the feature maps481 generated from the de-convolution layer 484. The feature maps 423may have a size of 64×64×12 and a depth of 64. The concatenated featuremaps 481 may have a size of 64×64×12 and a depth of 64. The feature maps483 may correspondingly have a size of 64×64×12 and a depth of 128.

The feature map 483 may be fed into one or more convolution layers 482.Each of the convolutional layers 482 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 482 may be 64, so that the output feature maps 485 of theconvolution layers 482 may have a size of 64×64×12 and a depth of 64.

As described by operator ⊕ 486 in FIG. 4, the feature maps 481 and 485may be added together. The output feature maps 487 of ⊕ 486 may be anelement-by-element summation of feature maps 481 and 485. The featuremaps 487 may have a size of 64×64×12 and a depth of 64.

The feature maps 487 may be fed into a de-convolution layer 494. Thede-convolution layer 494 may de-convolute input feature maps 487 toincrease its resolution by a factor of 2 along x-axis, y-axis andz-axis. The output feature maps 491 of the de-convolution layer 494 mayhave a size of 128×128×24. The number of features in the de-convolutionlayer 494 may be 32, so that the output feature maps 491 of theconvolution layer 494 may have a depth of 32.

As denoted by operator “©” in FIG. 4, the feature maps 413 generatedfrom the convolution layer 412 may be concatenated with the feature maps491 generated from the de-convolution layer 494. The feature maps 413may have a size of 128×128×24 and a depth of 32. The feature maps 491may have a size of 128×128×24 and a depth of 32. The concatenatedfeature maps 493 may correspondingly have a size of 128×128×24 and adepth of 64.

The feature map 493 may be fed into one or more convolution layers 492.Each of the convolutional layers 492 may comprise a 3×3×3 convolutionkernels, batch normalization (BN), and ReLU activation function toextract high-level features. The number of features in the convolutionlayer 492 may be 32, so that the output feature maps 495 of theconvolution layers 492 may have a size of 128×128×24 and a depth of 32.

As described by operator ⊕ 496 in FIG. 4, the feature maps 491 and 495may be added together. The output feature maps 497 of ⊕ 496 may be anelement-by-element summation of feature maps 491 and 495. The featuremaps 497 may have a size of 128×128×24 and a depth of 32.

The feature maps 497 may be fed into a convolutional layer 498 togenerate voxel-wise binary classification probabilities. Theconvolutional layer 498 may include a 1×1×1 kernel and sigmoidactivation function.

During training phase, Dice loss may be adopted as the objectivefunction, and the dice loss may be expressed as

${{Dice}\mspace{14mu} {loss}} = \frac{2{\sum\limits_{i}^{N}{g_{i}p_{i}}}}{{\sum\limits_{i}^{N}g_{i}} + {\sum\limits_{i}^{N}p_{i}}}$

Where g_(i) and p_(i) are the ground truth label and predicted labelrespectively. Post-processing steps may be then applied to refine theinitial segmentation generated by the FCN model. In some embodiments,specifically, a 3D Gaussian filter may be used to smooth the predictedprobability maps and a connected component analysis may be used toremove small isolated components.

Fully Convolutional Network (FCN) Enhanced by a Coarse-To-FineArchitecture

The present disclosure describes an FCN with coarse-to-finearchitecture, which takes advantages of residual learning and deepsupervision, in order to improve the segmentation performance andtraining efficiency.

One embodiment of such an FCN model is shown as 500 in FIG. 5. The FCNmodel includes a down-sampling/contraction path 504 and anup-sampling/expansion path 508, similar to the model described in FIG.3. Auxiliary convolutional layers 522, 532, 542, and 552 may beconnected to feature maps 521, 531, 541, and 551 with progressiveresolution in the expansion path 508, in order to generate singlefeature maps which are then up-sampled and fed into a sigmoid functionto obtain auxiliary predictions 559.

The auxiliary predictions 559 may be used to further determine the maskfor the organ or lesion. During the training process using input imagesfor the FCN model in FIG. 5, the output segmentation mask generated fromthe auxiliary predictions 559 may be compared with the ground truth maskof the input images. A loss function may be determined. In oneimplementation, the loss function may include a soft max cross-entropyloss. In another implementation, the loss function may include dicecoefficient (DC) loss function.

The input 2D/3D images 501 in FIG. 5 may be, for example and not limitedto, 3D images having a size of 128×128×24. In the expansion path 508,the feature maps 521 may have a size of 16×16×3 and a depth of 128; thefeature maps 531 may have a size of 32×32×6 and a depth of 64; thefeature maps 541 may have a size of 64×64×12 and a depth of 32; and thefeature maps 551 may have a size of 128×128×24 and a depth of 16.

The feature maps 521 may be fed into an auxiliary convolutional layer522. The auxiliary convolutional layer 522 may have a kernel size of1×1×1 for 3D images or 1×1 for 2D images. The output feature maps 523may have a size of 16×16×3 and a depth of one. The output feature maps523 may be fed into a de-convolutional layer 524 to increase theresolution, so as to generate feature maps 527 having a size of 32×32×6.

The feature maps 531 may be fed into an auxiliary convolutional layer532. The auxiliary convolutional layer 532 may have a kernel size of1×1×1 for 3D images or 1×1 for 2D images. The output feature maps 533may have a size of 32×32×6 and a depth of one. The output feature maps533 may be added onto the feature maps 527 in an element-by-elementmanner to generate feature maps 535. The feature maps 535 may have asize of 32×32×6 and a depth of one. The feature maps 535 may be fed intoa de-convolutional layer 536 to increase the resolution, so as togenerate feature maps 537 having a size of 64×64×12.

The feature maps 541 may be fed into an auxiliary convolutional layer542. The auxiliary convolutional layer 542 may have a kernel size of1×1×1 for 3D images or 1×1 for 2D images. The output feature maps 543may have a size of 64×64×12 and a depth of one. The output feature maps543 may be added onto the feature maps 537 element-by-element togenerate feature maps 545. The feature maps 545 may have a size of64×64×12 and a depth of one. The feature maps 545 may be fed into ade-convolutional layer 546 to increase the resolution, so as to generatefeature maps 547 having a size of 128×128×24.

The feature maps 551 may be fed into an auxiliary convolutional layer552. The auxiliary convolutional layer 552 may have a kernel size of1×1×1 for 3D images or 1×1 for 2D images. The output feature maps 553may have a size of 128×128×24 and a depth of one. The output featuremaps 553 may be added onto the feature maps 547 in element-by-element togenerate feature maps 555. The feature maps 555 may have a size of128×128×64 and a depth of one.

The feature maps 555 may be fed into a sigmoid function, which takefeature maps 555 as input to generate auxiliary 4 (559), which includesvoxel-wise binary classification probabilities and may be furtherprocessed to be the mask for the tissue or lesion.

Another enhanced embodiment of an FCN model 600 is shown in FIG. 6A. Inthe FCN model 600, the feature maps with reduced resolutions in theup-sampling/expansion path may be fed into auxiliary convolutionallayers, up-sampled to an original resolution, and then fed into asigmoid function to generate their corresponding auxiliary predictions.The one or more auxiliary predictions may be combined together togenerate the mask for the tissue or lesion.

In particular, the feature maps 521 may be fed into an auxiliaryconvolutional layer 522 to generate feature maps 523 having a size of16×16×3 and a depth of one. The output feature maps 523 may be fed intoone or more de-convolutional layers to generate feature maps 528 andexpand the resolution from 16×16×3 to 128×128×24. In the embodiment asshown in FIG. 6A, the one or more de-convolutional layers with respectto feature map 521 (lowest resolution layer in the expansion network)may include three de-convolutional layers and each of the threede-convolutional layers may expand the resolution by a factor of 2 inx-axis, y-axis and z-axis. In another embodiment, the one or morede-convolutional layers with respect to feature map 521 (lowestresolution layer in the expansion network) may include onede-convolutional layer, which may expand the resolution by a factor of 8in x-axis, y-axis and z-axis. As such, the feature maps 528 may recoverthe full resolution and may be then fed into a sigmoid function toobtain auxiliary 1 (529).

The feature maps 535 at the second lowest resolution layer may be fedinto one or more de-convolutional layers to generate feature maps 538and expand the resolution from 32×32×6 to 128×128×24 (full resolution).In the embodiment as shown in FIG. 6A, the one or more de-convolutionallayers here may include two de-convolutional layers and each of the twode-convolutional layers may expand the resolution by a factor of 2 inx-axis, y-axis and z-axis. In another embodiment, the one or morede-convolutional layers may include one de-convolutional layer, whichmay expand the resolution by a factor of 4 in x-axis, y-axis and z-axis.The feature maps 538 may be fed into a sigmoid function to obtainauxiliary 2 (539).

The feature maps 545 at the second highest resolution layer may be fedinto a de-convolutional layer to generate feature maps 548 and expandthe resolution from 64×64×12 to 128×128×24 (full resolution). In oneembodiment as shown in FIG. 6A, the de-convolutional layer may expandthe resolution by a factor of 2 in x-axis, y-axis and z-axis. Thefeature maps 548 may be fed into a sigmoid function to obtain auxiliary3 (549).

The one or more auxiliary predictions 529, 539, 549, and 559 in FIG. 6Amay be combined together to generate the mask for the tissue or lesion.The manner of combination will be discussed in more detail below.

For example, one implementation for combining the auxiliary predictions529, 539, 549, and 559 is shown in FIG. 6B. In particular, the auxiliarypredictions 529, 539, 549, and 559 may be concatenated together in step620 to generate a concatenated auxiliary prediction 621. Theconcatenated auxiliary prediction 621 may have a size of 128×128×24 anda depth of four. The concatenated auxiliary prediction 621 may be fedinto a convolutional layer with kernel size of 1×1×1 to generateauxiliary prediction 623. The auxiliary prediction 623 may have a sizeof 128×128×24 and a depth of one. The auxiliary prediction 623 may befed into a sigmoid function to obtain final prediction 629.

The final prediction 629 may be used to further determine the mask forthe tissue or lesion. During the training process of input images forthe FCN model in FIG. 6B, the output segmentation mask generated fromthe prediction 629 may be compared with the ground truth mask of theinput images. A loss function may be determined. In one implementation,the loss function may include a soft max cross-entropy loss. In anotherimplementation, the loss function may include dice coefficient (DC) lossfunction.

The implementation of FIG. 6B may be further augmented as shown in FIG.6C, where an auto-context strategy may be further used in the FCN model.Specifically, the input 2D/3D images 501 in addition to the auxiliarypredictions 529, 539, 549, and 559 may be concatenated together in step620 to generate a combined auxiliary prediction 621. The concatenatedauxiliary prediction 621 may have a size of 128×128×24 and a depth offive (compared with the depth of 4 for the implementation of FIG. 6B).The concatenated auxiliary prediction 621 may be fed into aconvolutional layer with kernel size of 1×1×1 to generate auxiliaryprediction 623. The auxiliary prediction 623 may have a size of128×128×24 and a depth of one. The auxiliary prediction 623 may be fedinto a sigmoid function to obtain final prediction 629. The finalprediction 629 may be used to further determine the mask for the tissueor lesion.

Another implementation for combining auxiliary predictions 529, 539,549, and 559 is shown in FIG. 6D where an element-by-element summationrather than concatenation is used. Specifically, the auxiliarypredictions 529, 539, 549, and 559 may be added together in step 640element-by-element to generate a combined auxiliary prediction 641. Thecombined auxiliary prediction 641 may have a size of 128×128×24 and adepth of one. The combined auxiliary prediction 641 may be fed into asigmoid function 642 to obtain final prediction 649. The finalprediction 649 may be used to further determine the mask for the tissueor lesion.

The implementation of FIG. 6D may be further augmented as shown in FIG.6E. Specifically, the input 2D/3D images 501 in addition to theauxiliary predictions 529, 539, 549, and 559 may be added together instep 640 in an element-by-element manner to generate a combinedauxiliary prediction 641. The combined auxiliary prediction 641 may havea size of 128×128×24 and a depth of one. The combined auxiliaryprediction 641 may be fed into a sigmoid function 642 to obtain finalprediction 649. The final prediction 629 may be used to furtherdetermine the mask for the tissue or lesion of the input image.

Fully Convolutional Network (FCN) with Coarse-To-Fine Architecture andDensely Connected Convolutional Module

Another implementation of the FCN based on FIG. 6A is shown in FIG. 6F,where the combination of the auxiliary prediction masks 529, 539, 549,and 559 may be further processed by a densely connected convolutional(DenseConv) network to extract auto-context features prior to obtainingthe final prediction mask 669.

Particularly in FIG. 6F, the auxiliary predictions 529, 539, 549, and559 may be concatenated together in step 660 to generate a concatenatedauxiliary prediction, also referred to as auto-context input 661. Theconcatenated auxiliary prediction 661 may have a size of 128×128×24 anda depth of four. The auto-context input 661 may be fed into a DenseConvmodule 662 to generate a prediction map 663. The prediction map 663 mayhave a depth of one or more. The prediction map 663 may be fed into aconvolutional layer 664 to generate a prediction map 665. The predictionmap 665 may have a size of 128×128×24 and a depth of one.

The prediction map 665 may be optionally added together with the featuremaps 555 in an element-by-element manner to generate a prediction map667. The prediction map 667 may have a size of 128×128×24 and a depth ofone. The prediction map 667 may be fed into a sigmoid function 668 toobtain final prediction 669.

The final prediction 669 may be used to further determine the mask forthe tissue or lesion. During the training process of input images forthe FCN model in FIG. 6F, the output segmentation mask generated fromthe prediction 629 may be compared with the ground truth mask of theinput images. A loss function may be determined. In one implementation,the loss function may include a soft max cross-entropy loss. In anotherimplementation, the loss function may include dice coefficient (DC) lossfunction.

The implementation of FIG. 6F may be further augmented as shown in FIG.6G. In FIG. 6G, the input 2D/3D images 501 in addition to the auxiliarypredictions 529, 539, 549, and 559 may be concatenated together in step660 to generate the auto-context input 661. The concatenated auxiliaryinput at 661 may have a size of 128×128×24 and a depth of five ratherthan the depth of 4 in FIG. 6F. The auto-context input 661 may be fedinto a DenseConv module 662 to generate a prediction map 663. Theprediction map 663 may have a depth of one or more. The prediction map663 may be fed into a convolutional layer 664 to generate a predictionmap 665. The prediction map 665 may have a size of 128×128×24 and adepth of one.

Similar to FIG. 6F, the prediction map 665 in FIG. 6G may be addedtogether with the feature maps 555 in an element-by-element manner togenerate a prediction map 667. The prediction map 667 may have a size of128×128×24 and a depth of one. The prediction map 667 may be fed into asigmoid function 668 to obtain final prediction 669. The finalprediction 669 may be used to further determine the mask for the tissueor lesion. During the training process of input images for the FCN modelin FIG. 6G, the output segmentation mask generated from the prediction629 may be compared with the ground truth mask of the input images. Aloss function may be determined. In one implementation, the lossfunction may include a soft max cross-entropy loss. In anotherimplementation, the loss function may include dice coefficient (DC) lossfunction.

The DenseConv module 662 above in FIG. 6F and FIG. 6G may include one ormore convolutional layers. One embodiment is shown in FIG. 7 for aDenseConv module 700 including six convolutional layers 710, 720, 730,740, 750, and 760. Each of the six convolutional layers may include aconvolutional layer with an exemplary kernel size 3×3×3 and with anumber of feature maps of 16. Each of the six convolution layers mayfurther include a batch normalization (BN) layer, and an ReLU layer.

An auto-context input 701 may be fed into the convolutional layer 710 togenerate feature maps 713. In some exemplar implementations, theauto-context input 701 may have a size of 128×128×24 and a depth offour. The feature maps 713 may have a size of 128×128×24 and a depth of16.

For a current convolutional layer, the input of the currentconvolutional layer may be concatenated with the output of the currentconvolutional layer to generate the input for the next convolutionallayer.

For example, as shown in FIG. 7, the auto-context input 701 may beconcatenated with the output 713 of the convolutional layer 710 togenerate the feature maps 715. The feature maps 715 may be the input forthe convolutional layer 720.

As shown in FIG. 7, the feature maps 715 may be fed into theconvolutional layer 720 to generate the feature maps 723. The featuremaps 723 may be concatenated with the feature maps 715 to generate thefeature maps 725. The feature maps 725 may be the input for theconvolutional layer 730.

As further shown in FIG. 7, the feature maps 725 may be fed into theconvolutional layer 730 to generate the feature maps 733. The featuremaps 733 may be concatenated with the feature maps 725 to generate thefeature maps 735. The feature maps 735 may be the input for theconvolutional layer 740. The feature maps 735 may be fed into theconvolutional layer 740 to generate the feature maps 743. The featuremaps 743 may be concatenated with the feature maps 735 to generate thefeature maps 745. The feature maps 745 may be the input for theconvolutional layer 750. Furthermore, the feature maps 745 may be fedinto the convolutional layer 750 to generate the feature maps 753. Thefeature maps 753 may be concatenated with the feature maps 745 togenerate the feature maps 755. The feature maps 755 may be the input forthe convolutional layer 760.

As shown in FIG. 7, the feature maps 755 may be fed into theconvolutional layer 760 to generate the feature maps, which serve as theoutput 791 of the DenseConv module 700.

The six-convolutional layer DenseConv implementation above is merelyexemplary. In other embodiment, the DenseConv module 700 may include anynumber of convolutional layers, and each of the convolutional layers mayuse any kernel size and/or include any number of features.

Training

During training process, the output segmentation mask generated inforward-propagation from an FCN model may be compared with the groundtruth mask of the input images. A loss function may be determined. Inone implementation, the loss function may include a soft maxcross-entropy loss. In another implementation, the loss function mayinclude dice coefficient (DC) loss function. Then a back-propagationthrough the FCN model may be performed based on, e.g., stochasticgradient descent, and aimed at minimizing the loss function. Byiterating the forward-propagation and back-propagation for the sameinput images, and for the entire training image set, the trainingparameters may converge to provide acceptable errors between thepredicted masks and ground truth masks for all or most of the inputimages. The converged training parameters, including but not limited tothe convolutional features/kernels and various weights and bias, mayform a final predictive model that may be further verified using testimages and used to predict segmentation or lesion masks for images thatthe network has never seen before. The FCN model is preferably trainedto promote errors on the over-inclusive side to reduce or prevent falsenegatives in later stage of CAD based on a predicted mask.

For complex models with multiple stages such as the ones shown in FIGS.6F and 6G including model stage I and stage II, the training process mayinclude three steps as shown in FIG. 8.

Step 810 may include training Stage 1 (of the FCN model, e.g., FIGS. 6Fand 6G) for generating prediction masks based on one or more Auxiliaryoutputs in Stage 1 by comparing the prediction masks with the groundtruth masks. The one or more Auxiliary output in Stage 1 may include oneor more from the Auxiliary 1-4 (529, 539, 549, and 559).

Step 820 may include fixing the training parameter in Stage 1 andtraining Stage 2 (of the FCN model, e.g., FIGS. 6F and 6G) by generatingprediction masks based on the output of Stage 2 and comparing theprediction masks with the ground truth masks. Stage 2 may include aDenseConv module, and the output of Stage 2 may include Prediction 669.

Step 830 may include fine tuning and training of Stage 1 and Stage 2jointly by using the model parameters obtained in steps 810 and 820 asinitial values and further performing forward and back-propagationtraining processes based on the output of Stage 2. Stage 2 may include aDenseConv module, and the output of Stage 2 may include Prediction 669.

System with Different FCN Models

Because different FCN models trained under different conditions mayperform well for different input images and under various circumstances,a segmentation/lesion detection model may include multiple FCNsub-models to improve prediction accuracy. FIG. 9 shows one embodimentof a system 900 including two different FCN models: a first FCN 920 anda second FCN 930.

The first FCN 920 and the second FCN 930 may be different in term oftheir architecture, for example, the first FCN 920 may be an FCN modelsimilar to the model in FIG. 3, and the second FCN 930 may be an FCNmodel similar the model in FIG. 6F.

The first FCN 920 and the second FCN 930 may be different in term of howthe multiple channels are processed, for example, the first FCN 920 maybe an FCN model processing each channel individually and independently(e.g., FCN(2D/3D, SC), as described above), and the second FCN 930 maybe an FCN model processing all multiple channels together in anaggregated manner (e.g., FCN (2D/3D, MC), as described above).

For input images 910 of the system 900 in FIG. 9, a first predication921 may be generated from the first FCN 920 and a second prediction 931may be generated from the second FCN 930. The first predication 921 andsecond prediction 931 may be fed into a parameterized comparator 950 todetermine which prediction of 921 and 931 to select, so as to generate afinal prediction 990. The parameters for the comparator may be trained.

The selection by the comparator 950 may be performed at an individualpixel/voxel level, i.e., the comparator 950 may select, for eachindividual pixel/voxel, which probability for the correspondingpixel/voxel, out of the first predication 921 and second prediction 931,as the final prediction for the corresponding pixel/voxel.

Hardware Implementations

The FCN image segmentation and/or lesion detection above may beimplemented as a computer platform 1000 shown in FIG. 10. The computerplatform 1000 may include one or more training servers 1004 and 1006,one or more prediction engines 1008 and 1010, one or more databases1012, one or more model repositories 1002, and user devices 1014 and1016 associated with users 1022 and 1024. These components of thecomputer platform 1000 are inter-connected and in communication with oneanother via public or private communication networks 1001.

The training servers and prediction engine 1004, 1006, 1008, and 1010may be implemented as a central server or a plurality of serversdistributed in the communication networks. The training servers 1004 and1006 may be responsible for training the FCN segmentation modeldiscussed above. The prediction engine 1008 and 1010 may be responsibleto analyze an input image using the FCN segmentation model to generate asegmentation mask for the input image. While the various servers areshown in FIG. 10 as implemented as separate servers, they may bealternatively combined in a single server or single group of distributedservers combining the functionality of training and prediction. The userdevices 1014 and 1016 may be of any form of mobile or fixed electronicdevices including but not limited to desktop personal computer, laptopcomputers, tablets, mobile phones, personal digital assistants, and thelike. The user devices 1014 and 1016 may be installed with a userinterface for accessing the digital platform.

The one or more databases 1012 of FIG. 10 may be hosted in a centraldatabase server or a plurality of distributed database servers. Forexample, the one or more databases 1020 may be implemented as beinghosted virtually in a cloud by a cloud service provider. The one or moredatabases 1020 may organize data in any form, including but not limitedto relational database containing data tables, graphic databasecontaining nodes and relationships, and the like. The one or moredatabases 1020 may be configured to store, for example, images and theirlabeled masks collected from various sources. These images and labelsmay be used as training data corpus for the training server 1006 forgenerating DCNN segmentation models.

The one or more model repositories 602 may be used to store, forexample, the segmentation model with its trained parameters. In someimplementation, the model repository 602 may be integrated as part ofthe predictive engines 608 and 601.

FIG. 11 shows an exemplary computer system 1100 for implementing any ofthe computing components of FIGS. 1-10. The computer system 1100 mayinclude communication interfaces 1102, system circuitry 1104,input/output (I/O) interfaces 1106, storage 1109, and display circuitry1108 that generates machine interfaces 1110 locally or for remotedisplay, e.g., in a web browser running on a local or remote machine.The machine interfaces 1110 and the I/O interfaces 1106 may includeGUIs, touch sensitive displays, voice or facial recognition inputs,buttons, switches, speakers and other user interface elements.Additional examples of the I/O interfaces 1106 include microphones,video and still image cameras, headset and microphone input/outputjacks, Universal Serial Bus (USB) connectors, memory card slots, andother types of inputs. The I/O interfaces 1106 may further includemagnetic or optical media interfaces (e.g., a CDROM or DVD drive),serial and parallel bus interfaces, and keyboard and mouse interfaces.

The communication interfaces 1102 may include wireless transmitters andreceivers (“transceivers”) 1112 and any antennas 1114 used by thetransmitting and receiving circuitry of the transceivers 1112. Thetransceivers 1112 and antennas 1114 may support Wi-Fi networkcommunications, for instance, under any version of IEEE 802.11, e.g.,802.11n or 802.11ac. The communication interfaces 1102 may also includewireline transceivers 1116. The wireline transceivers 1116 may providephysical layer interfaces for any of a wide range of communicationprotocols, such as any type of Ethernet, data over cable serviceinterface specification (DOCSIS), digital subscriber line (DSL),Synchronous Optical Network (SONET), or other protocol.

The storage 1109 may be used to store various initial, intermediate, orfinal data or model needed for the implantation of the computer platform1000. The storage 1109 may be separate or integrated with the one ormore databases 1012 of FIG. 10. The storage 1109 may be centralized ordistributed, and may be local or remote to the computer system 1100. Forexample, the storage 1109 may be hosted remotely by a cloud computingservice provider.

The system circuitry 1104 may include hardware, software, firmware, orother circuitry in any combination. The system circuitry 1104 may beimplemented, for example, with one or more systems on a chip (SoC),application specific integrated circuits (ASIC), microprocessors,discrete analog and digital circuits, and other circuitry. The systemcircuitry 1104 is part of the implementation of any desiredfunctionality related to the reconfigurable computer platform 1000. Asjust one example, the system circuitry 1104 may include one or moreinstruction processors 1118 and memories 1120. The memories 1120 stores,for example, control instructions 1126 and an operating system 1124. Inone implementation, the instruction processors 1118 executes the controlinstructions 1126 and the operating system 1124 to carry out any desiredfunctionality related to the reconfigurable computer platform 1000.

The methods, devices, processing, and logic described above may beimplemented in many different ways and in many different combinations ofhardware and software. For example, all or parts of the implementationsmay be circuitry that includes an instruction processor, such as aCentral Processing Unit (CPU), microcontroller, or a microprocessor; anApplication Specific Integrated Circuit (ASIC), Programmable LogicDevice (PLD), or Field Programmable Gate Array (FPGA); or circuitry thatincludes discrete logic or other circuit components, including analogcircuit components, digital circuit components or both; or anycombination thereof. The circuitry may include discrete interconnectedhardware components and/or may be combined on a single integratedcircuit die, distributed among multiple integrated circuit dies, orimplemented in a Multiple Chip Module (MCM) of multiple integratedcircuit dies in a common package, as examples.

The circuitry may further include or access instructions for executionby the circuitry. The instructions may be stored in a tangible storagemedium that is other than a transitory signal, such as a flash memory, aRandom Access Memory (RAM), a Read Only Memory (ROM), an ErasableProgrammable Read Only Memory (EPROM); or on a magnetic or optical disc,such as a Compact Disc Read Only Memory (CDROM), Hard Disk Drive (HDD),or other magnetic or optical disk; or in or on another machine-readablemedium. A product, such as a computer program product, may include astorage medium and instructions stored in or on the medium, and theinstructions when executed by the circuitry in a device may cause thedevice to implement any of the processing described above or illustratedin the drawings.

The implementations may be distributed as circuitry among multiplesystem components, such as among multiple processors and memories,optionally including multiple distributed processing systems.Parameters, databases, and other data structures may be separatelystored and managed, may be incorporated into a single memory ordatabase, may be logically and physically organized in many differentways, and may be implemented in many different ways, including as datastructures such as linked lists, hash tables, arrays, records, objects,or implicit storage mechanisms. Programs may be parts (e.g.,subroutines) of a single program, separate programs, distributed acrossseveral memories and processors, or implemented in many different ways,such as in a library, such as a shared library (e.g., a Dynamic LinkLibrary (DLL)). The DLL, for example, may store instructions thatperform any of the processing described above or illustrated in thedrawings, when executed by the circuitry.

While the particular invention has been described with reference toillustrative embodiments, this description is not meant to be limiting.Various modifications of the illustrative embodiments and additionalembodiments of the invention will be apparent to one of ordinary skillin the art from this description. Those skilled in the art will readilyrecognize that these and various other modifications can be made to theexemplary embodiments, illustrated and described herein, withoutdeparting from the spirit and scope of the present invention. It istherefore contemplated that the appended claims will cover any suchmodifications and alternate embodiments. Certain proportions within theillustrations may be exaggerated, while other proportions may beminimized. Accordingly, the disclosure and the figures are to beregarded as illustrative rather than restrictive.

What is claimed is:
 1. A system for performing segmentation of digitalimages, comprising: a communication interface circuitry; a database; apredictive model repository; and a processing circuitry in communicationwith the database and the predictive model repository, the processingcircuitry configured to: receive a set of training images labeled with acorresponding set of ground truth segmentation masks; establish a fullyconvolutional neural network comprising a multi-layer contractionconvolutional neural network and an expansion convolutional neuralnetwork connected in tandem; and iteratively train the full convolutionneural network in an end-to-end manner using the set of training imagesand the corresponding set of ground truth segmentation masks byconfiguring the processing circuitry to: down-sample a training image ofthe set of training images through the multi-layer contractionconvolutional neural network to generate an intermediate feature map,wherein a resolution of the intermediate feature map is lower than aresolution of the training image, up-sample the intermediate feature mapthrough the multi-layer expansion convolutional neural network togenerate a first feature map, generate, based on the training image andthe first feature map, a predictive segmentation mask for the trainingimage, generate an end loss based on a difference between the predictivesegmentation mask and a ground truth segmentation mask corresponding tothe training image, back-propagate the end loss through the fullconvolutional neural network, and minimize the end loss by adjusting aset of training parameters of the fully convolutional neural networkusing gradient descent.
 2. The system of claim 1, wherein the processingcircuitry is further configured to: store the iteratively trained fullyconvolutional neural network with the set of training parameters in thepredictive model repository; receive an input image, wherein the inputimage comprises one of a test image or a unlabeled image; andforward-propagate the input image through the iteratively trained fullyconvolutional neural network with the set of training parameters togenerate an output segmentation mask.
 3. The system of claim 1, whereinwhen the processing circuitry is configured to generate, based on thetraining image and the first feature map, the predictive segmentationmask for the training image, the processing circuitry is configured to:implement a first auxiliary convolutional layer on the first feature mapto generate a convoluted first feature map, the convoluted first featuremap having a same resolution as the first feature map, the convolutedfirst feature map having a depth of one; when the convoluted firstfeature map has a different resolution as the training image, adjust aresolution of the convoluted first feature map to have the sameresolution as the training image; and generate the predictivesegmentation mask for the training image, based on the training imageand the resolution-adjusted convoluted first feature map.
 4. The systemof claim 3, wherein when the processing circuitry is configured togenerate the predictive segmentation mask for the training image, basedon the training image and the resolution-adjusted convoluted firstfeature map, the processing circuitry is configured to: perform a firstsigmoid function on the resolution-adjusted convoluted first feature mapto generate a first auxiliary prediction map; and generate thepredictive segmentation mask for the training image, based on thetraining image and the first auxiliary prediction map.
 5. The system ofclaim 4, wherein when the processing circuitry is configured to generatethe predictive segmentation mask for the training image, based on thetraining image and the first auxiliary prediction map, the processingcircuitry is configured to: add the training image and the firstauxiliary prediction map to generate an auto-context prediction map, orconcatenate the training image and the first auxiliary prediction map togenerate the auto-context prediction map; generate the predictivesegmentation mask for the training image, based on the auto-contextprediction map.
 6. The system of claim 5, wherein when the processingcircuitry is configured to generate the predictive segmentation mask forthe training image, based on the auto-context prediction map, theprocessing circuitry is configured to: perform a densely connectedconvolutional (DenseConv) operation on the auto-context prediction mapto generate a DenseConv prediction map, the DenseConv operationincluding one or more convolutional layers; perform a DenseConvauxiliary convolutional layer on the DenseConv prediction map togenerate a convoluted DenseConv prediction map; add the convolutedDenseConv prediction map and the added second feature map to generatethe added DenseConv prediction map; and generate, based on the addedDenseConv prediction map, the predictive segmentation mask for thetraining image.
 7. The system of claim 1, wherein the processingcircuitry is further configured to iteratively train the fullconvolution neural network in the end-to-end manner using the set oftraining images and the corresponding set of ground truth segmentationmasks by configuring the processing circuitry to: up-sample theintermediate feature map through the multi-layer expansion convolutionalneural network to generate a second feature map, wherein a resolution ofthe second feature map is larger than the resolution of the firstfeature map; implement a first auxiliary convolutional layer on thefirst feature map to generate a convoluted first feature map, theconvoluted first feature map having a same resolution as the firstfeature map, the convoluted first feature map having a depth of one;implement a second auxiliary convolutional layer on the second featuremap to generate a convoluted second feature map, the convoluted secondfeature map having a same resolution as the second feature map, theconvoluted second feature map having a depth of one; implement a firstde-convolutional layer on the convoluted first feature map to generate ade-convoluted first feature map, the de-convoluted first feature maphaving a larger resolution than the first feature map; add thede-convoluted first feature map and the convoluted second feature map togenerate an added second feature map; perform a second sigmoid functionon the added second feature map to generate a second auxiliaryprediction map; and generate the predictive segmentation mask for thetraining image, based on the training image and the second auxiliaryprediction map.
 8. The system of claim 7, wherein when the processingcircuitry is configured to generate the predictive segmentation mask forthe training image, based on the training image and the second auxiliaryprediction map, the processing circuitry is configured to: add thetraining image and the second auxiliary prediction map to generate anauto-context prediction map, or concatenate the training image and thesecond auxiliary prediction map to generate an auto-context predictionmap; and generate the predictive segmentation mask for the trainingimage, based on the auto-context prediction map.
 9. A method for imagesegmentation, comprising: receiving, by a computer comprising a memorystoring instructions and a processor in communication with the memory, aset of training images labeled with a corresponding set of ground truthsegmentation masks; establishing, by the computer, a fully convolutionalneural network comprising a multi-layer contraction convolutional neuralnetwork and an expansion convolutional neural network connected intandem; and iteratively training, by the computer, the full convolutionneural network in an end-to-end manner using the set of training imagesand the corresponding set of ground truth segmentation masks by:down-sampling a training image of the set of training images through themulti-layer contraction convolutional neural network to generate anintermediate feature map, wherein a resolution of the intermediatefeature map is lower than a resolution of the training image;up-sampling the intermediate feature map through the multi-layerexpansion convolutional neural network to generate a first feature map;generating, based on the training image and the first feature map, apredictive segmentation mask for the training image; generating an endloss based on a difference between the predictive segmentation mask anda ground truth segmentation mask corresponding to the training image;back-propagating the end loss through the full convolutional neuralnetwork; and minimizing the end loss by adjusting a set of trainingparameters of the fully convolutional neural network using gradientdescent.
 10. The method of claim 9, further comprising: storing, by thecomputer, the iteratively trained fully convolutional neural networkwith the set of training parameters in a predictive model repository;receiving, by the computer, an input image, wherein the input imagecomprises one of a test image or a unlabeled image; andforward-propagating, by the computer, the input image through theiteratively trained fully convolutional neural network with the set oftraining parameters to generate an output segmentation mask.
 11. Themethod of claim 9, wherein the generating, based on the training imageand the first feature map, the predictive segmentation mask for thetraining image comprises: implementing, by the computer, a firstauxiliary convolutional layer on the first feature map to generate aconvoluted first feature map, the convoluted first feature map having asame resolution as the first feature map, the convoluted first featuremap having a depth of one; when the convoluted first feature map has adifferent resolution as the training image, adjusting, by the computer,a resolution of the convoluted first feature map to have the sameresolution as the training image; and generating, by the computer, thepredictive segmentation mask for the training image, based on thetraining image and the resolution-adjusted convoluted first feature map.12. The method of claim 11, wherein the generating the predictivesegmentation mask for the training image, based on the training imageand the resolution-adjusted convoluted first feature map, comprises:performing a first sigmoid function on the resolution-adjustedconvoluted first feature map to generate a first auxiliary predictionmap; and generating the predictive segmentation mask for the trainingimage, based on the training image and the first auxiliary predictionmap.
 13. The method of claim 12, wherein the generating the predictivesegmentation mask for the training image, based on the training imageand the first auxiliary prediction map comprises: adding the trainingimage and the first auxiliary prediction map to generate an auto-contextprediction map, or concatenating the training image and the firstauxiliary prediction map to generate the auto-context prediction map;generating the predictive segmentation mask for the training image,based on the auto-context prediction map.
 14. The method of claim 13,wherein the generating the predictive segmentation mask for the trainingimage, based on the auto-context prediction map, comprises: performing adensely connected convolutional (DenseConv) operation on theauto-context prediction map to generate a DenseConv prediction map, theDenseConv operation including one or more convolutional layers;performing a DenseConv auxiliary convolutional layer on the DenseConvprediction map to generate a convoluted DenseConv prediction map; addingthe convoluted DenseConv prediction map and the added second feature mapto generate the added DenseConv prediction map; and generating, based onthe added DenseConv prediction map, the predictive segmentation mask forthe training image.
 15. The method of claim 9, wherein the iterativelytraining the full convolution neural network in the end-to-end mannerusing the set of training images and the corresponding set of groundtruth segmentation masks further comprises: up-sampling the intermediatefeature map through the multi-layer expansion convolutional neuralnetwork to generate a second feature map, wherein a resolution of thesecond feature map is larger than the resolution of the first featuremap; implementing a first auxiliary convolutional layer on the firstfeature map to generate a convoluted first feature map, the convolutedfirst feature map having a same resolution as the first feature map, theconvoluted first feature map having a depth of one; implementing asecond auxiliary convolutional layer on the second feature map togenerate a convoluted second feature map, the convoluted second featuremap having a same resolution as the second feature map, the convolutedsecond feature map having a depth of one; implementing a firstde-convolutional layer on the convoluted first feature map to generate ade-convoluted first feature map, the de-convoluted first feature maphaving a larger resolution than the first feature map; adding thede-convoluted first feature map and the convoluted second feature map togenerate an added second feature map; performing a second sigmoidfunction on the added second feature map to generate a second auxiliaryprediction map; and generating the predictive segmentation mask for thetraining image, based on the training image and the second auxiliaryprediction map.
 16. The method of claim 15, wherein the generating thepredictive segmentation mask for the training image, based on thetraining image and the second auxiliary prediction map comprises: addingthe training image and the second auxiliary prediction map to generatean auto-context prediction map, or concatenating the training image andthe second auxiliary prediction map to generate an auto-contextprediction map; and generating the predictive segmentation mask for thetraining image, based on the auto-context prediction map.
 17. A systemfor performing segmentation of digital images, comprising: acommunication interface circuitry; a database; a predictive modelrepository; and a processing circuitry in communication with thedatabase and the predictive model repository, the processing circuitryconfigured to: receive a set of training images labeled with acorresponding set of ground truth segmentation masks; establish asegmentation network comprising a first fully convolutional neuralnetwork, a second fully convolutional neural network, and an evaluationnetwork, wherein each of the first and second fully convolutional neuralnetworks comprises a multi-layer contraction convolutional neuralnetwork and an expansion convolutional neural network connected intandem, and the evaluation network is in communication with the firstand second fully convolutional neural networks; and iteratively trainthe segmentation network in an end-to-end manner using the set oftraining images and the corresponding set of ground truth segmentationmasks by configuring the processing circuitry to: generate a firstpredictive segmentation mask for a training image of the set of trainingimages, by the first fully convolutional neural network based on thetraining image, generate a second predictive segmentation mask for thetraining image, by the second fully convolutional neural network basedon the training image, generate a final predictive segmentation mask forthe training image, by the evaluation network based on the firstpredictive segmentation mask and the second predictive segmentationmask, generate an end loss based on a difference between the finalpredictive segmentation mask and a ground truth segmentation maskcorresponding to the training image, back-propagate the end loss throughthe segmentation network, and minimize the end loss by adjusting a setof training parameters of the segmentation network using gradientdescent.
 18. The system of claim 17, wherein the processing circuitry isfurther configured to: store the iteratively trained segmentationnetwork with the set of training parameters in the predictive modelrepository; receive an input image, wherein the input image comprisesone of a test image or a unlabeled image; and forward-propagate theinput image through the iteratively trained segmentation network withthe set of training parameters to generate an output segmentation mask.19. The system of claim 17, wherein when the processing circuitry isconfigured to generate the final predictive segmentation mask for thetraining image, by the evaluation network based on the first predictivesegmentation mask and the second predictive segmentation mask, theprocessing circuitry is configured to: generate a final predictive valuefor each pixel of the final predictive segmentation mask, by theevaluation network based on values of corresponding pixels of the firstpredictive segmentation mask and the second predictive segmentationmask.
 20. The system of claim 17, wherein when the processing circuitryis configured to generate the final predictive segmentation mask for thetraining image, by the evaluation network based on the first predictivesegmentation mask and the second predictive segmentation mask, theprocessing circuitry is configured to: add, by the evaluation network,the first predictive segmentation mask and the second predictivesegmentation mask to generate the final predictive segmentation mask.